{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "90ebfa19",
   "metadata": {},
   "source": [
    "# Fine-Tuning Llama 2 With Custom Data\n",
    "Llama 2 has gained traction as a robust, powerful family of Large Language Models that can provide compelling responses on a wide range of tasks. While the base 7B, 13B, and 70B models serve as a strong baseline for multiple downstream tasks, they can lack in domain-specific knowledge or proprietary or otherwise sensitive information. Fine-tuning is often used as a means to update a model for a specific task or tasks to better respond to domain-specific prompts. This notebook walks through downloading the Llama 2-7B model from Hugging Face, preparing a custom dataset, and p-tuning the base model against the dataset.\n",
    "\n",
    "**IMPORTANT:** Third party components used as part of this project are subject to their separate legal notices or terms that accompany the components. You are responsible for reviewing and confirming compliance with third-party component license terms and requirements.\n",
    "\n",
    "## Model Preparation\n",
    "The model needs to be downloaded and converted prior to being fine-tuned. The following blocks walk through this process.\n",
    "\n",
    "### Downloading the model\n",
    "First, the 7B variant of Llama 2 needs to be downloaded from Meta. Llama 2 is an open model that is available for commercial use. To download the model, [submit a request on Meta's portal](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) for access to all models in the Llama family. Please note that your Hugging Face account email address MUST match the email you provide on the Meta website, or your request will not be approved.\n",
    "\n",
    "Once approved, use your Hugging Face username and API key to download Llama2 7B (non-chat version) to your workstation where you will be fine-tuning the model. To pull the model files to your local machine, you may navigate on your local machine to the folder you specified as the mount for ```/project/models``` and use a ```git lfs clone https://huggingface.co/<namespace>/<repo-name>``` call to [Meta's HF repository](https://huggingface.co/meta-llama/Llama-2-7b-hf). \n",
    "\n",
    "Once you have the repository cloned locally, you can double check that the model is in the correct location if you can see ```/models/Llama-2-7b-hf``` show up on the left hand side panel of this jupyterlab. Ensure each of the following files are present: \n",
    "\n",
    "* config.json\n",
    "\n",
    "* generation_config.json\n",
    "\n",
    "* pytorch_model-00001-of-00002.bin\n",
    "\n",
    "* pytorch_model-00002-of-00002.bin\n",
    "\n",
    "* pytorch_model.bin.index.json\n",
    "\n",
    "* special_tokens_map.json\n",
    "\n",
    "* tokenizer_config.json\n",
    "\n",
    "* tokenizer.json\n",
    "\n",
    "* tokenizer.model\n",
    "\n",
    "### Converting to .nemo format\n",
    "Prior to fine-tuning the model, it needs to be converted to the .nemo format used by the NeMo Toolkit, which is a set of tools and configs comprising a toolkit for working with conversational AI. The NeMo Framework training container includes a script to convert Llama models to the .nemo format. **The following cell assumes the model has been saved to /models/Llama-2-7b-hf inside the container.**\n",
    "\n",
    "After conversion, a `llama-7b.nemo` file should be present in `/models`. This file will be used for fine-tuning for the remainder of the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a3808ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install --upgrade ipywidgets\n",
    "!python /opt/NeMo/scripts/nlp_language_modeling/convert_hf_llama_to_nemo.py --in-file=../models/Llama-2-7b-hf/ --out-file=../models/llama-7b.nemo\n",
    "!ls -la ../models/llama-7b.nemo"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e79bdbb",
   "metadata": {},
   "source": [
    "## Preparing The Dataset\n",
    "The dataset being used for fine-tuning needs to be converted to a .jsonl file and follow a specific format. In general, question and answer datasets are easiest to work with by providing context (if applicable), a question, and the expected answer, though different downstream tasks work as well.\n",
    "\n",
    "### Downloading the dataset\n",
    "This notebook will use the [Dolly dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k) - available on Hugging Face - as an example. However, this can replaced with other datasets available on Hugging Face as well. If using a dataset not available on Hugging Face, manually upload or download the dataset into `data` directory of the project and follow the steps below.\n",
    "\n",
    "As seen in the output, the dataset is currently in the form of a Hugging Face `DatasetDict` with 15,015 entries each including an `instruction`, `context`, `response`, and `category` field. For custom datasets, ensure the data has a similar structure of one unique item per row and all rows having the same fields. The fields do not need to match that of the Dolly dataset but should all match each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7efbc685",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "from omegaconf import OmegaConf\n",
    "import os\n",
    "os.environ['OPENBLAS_NUM_THREADS'] = '8'\n",
    "\n",
    "dataset = load_dataset(\"aisquared/databricks-dolly-15k\")\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63608201",
   "metadata": {},
   "source": [
    "### Preprocessing the dataset\n",
    "Some datasets may contain unnecessary fields. For the example with the Dolly dataset, we do not need the `closed_qa` field for the fine-tuned model and will remove those lines from the dataset. Additionally, the `instruction`, `response`, and `category` fields will be renamed to `question`, `answer`, and `taskname`, respectively. NeMo Toolkit requires p-tuning datasets to include `taskname` to specify which task an element belongs to in the case of multiple tasks. For this case, we will use `genqa` for the taskname for all items. In general, these fields are the most common to work with. Other field names can be used but the configuration step later in this notebook may need to be updated to reflect these custom fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "993fd17e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def set_column(example):\n",
    "    example[\"taskname\"] = 'genqa'\n",
    "    return example\n",
    "\n",
    "dataset = dataset.filter(lambda example: example[\"category\"].startswith(\"closed_qa\"))\n",
    "dataset = dataset.rename_column(\"instruction\", \"question\")\n",
    "dataset = dataset.rename_column(\"response\", \"answer\")\n",
    "dataset = dataset.rename_column(\"category\", \"taskname\")\n",
    "dataset = dataset.map(set_column)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b170ce6",
   "metadata": {},
   "source": [
    "After filtering the dataset above, we can see the first item in the dataset has a `question`, `context`, `answer`, and `taskname` field. Note that the following cell block might not work for custom datasets with different subsets other than `train`, `test`, and `val`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8dfbbba8",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset['train'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18015cdb",
   "metadata": {},
   "source": [
    "### Split the dataset into train and test files\n",
    "\n",
    "The prompt learning dataset loader accepts a list of json/dictionary objects or a list of json file names where each json file contains a collection of json objects. Each json object must include the field taskname which is a string identifier for the task the data example corresponds to. They should also include one or more fields corresponding to different sections of the discrete text prompt. The input data looks like:\n",
    "```\n",
    "[\n",
    "    {\"taskname\": \"genqa\", \"context\": [CONTEXT_PARAGRAPH_TEXT1], \"question\": [QUESTION_TEXT1], \"answer\": [ANSWER_TEXT1]},\n",
    "    {\"taskname\": \"genqa\", \"context\": [CONTEXT_PARAGRAPH_TEXT2], \"question\": [QUESTION_TEXT2], \"answer\": [ANSWER_TEXT2]},\n",
    "]\n",
    "```\n",
    "\n",
    "To create the files, we need to split the dataset object between train and validation files by taking the first 90% of the object and putting it in a `*train.jsonl` file and the remainder in a `*val.jsonl` file. Note that the split percentage can be changed as well as items being randomly sampled from the dataset to fill the quota if desired."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25a50061",
   "metadata": {},
   "outputs": [],
   "source": [
    "DATA_DIR = \"/project/data\"\n",
    "os.makedirs(DATA_DIR, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a115881",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "train_test_split = 0.9\n",
    "\n",
    "with open(\"/project/data/dolly_train.jsonl\", 'w') as f:\n",
    "    for index, item in enumerate(dataset['train']):\n",
    "        if index < int(train_test_split*len(dataset['train'])):\n",
    "            f.write(json.dumps(item) + \"\\n\")  \n",
    "\n",
    "with open(\"/project/data/dolly_val.jsonl\", 'w') as f:\n",
    "    for index, item in enumerate(dataset['train']):\n",
    "        if index >= int(train_test_split*len(dataset['train'])):\n",
    "            f.write(json.dumps(item) + \"\\n\")  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ce192ed",
   "metadata": {},
   "source": [
    "Let's take a look at a row in the training dataset file. Similar to the previous dataset output, this shows the fields of `question`, `context`, `answer`, and `taskname`. This format is used for both the `train` and `val` files. These files will be used as datasets for the p-tuning process. The dataset is now ready to be used for p-tuning the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26955785",
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -1 $DATA_DIR/dolly_train.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "883dccd9",
   "metadata": {},
   "source": [
    "## Configuring the job\n",
    "With the dataset preparation finished, we need to update the default configuration for our fine-tuning job. The sample config file provided by NeMo is a good template to base our changes on. Let's load the file as an object that we can edit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b810a401",
   "metadata": {},
   "outputs": [],
   "source": [
    "config = OmegaConf.load(\"/opt/NeMo/examples/nlp/language_modeling/tuning/conf/megatron_gpt_peft_tuning_config.yaml\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c0bc002",
   "metadata": {},
   "source": [
    "With the config loaded, we can override certain settings for our environment. Many of the default values shown here would work but some key points are called out below:\n",
    "\n",
    "* `config.trainer.precision=\"32\"` - This is the precision that will be used during p-tuning. The model might be more accurate with higher values but it also uses more memory than lower precisions. If the p-tuning process runs out of memory, try reducing the precision here.\n",
    "* `config.trainer.devices=1` - This is the number of devices that will be used. If running on a multi-GPU system, increase this number as appropriate.\n",
    "* `config.model.restore_from_path=\"/models/llama-7b.nemo\"` - This is the path to the converted `.nemo` checkpoint from the beginning of the notebook. If the path changed, update it here.\n",
    "* `config.model.data.train_ds.file_names` and `config.model.data.validation_ds.file_names` - If a different filename or path was used for the dataset files created earlier, specify the new values here.\n",
    "* `config.model.global_batch_size` - If using a higher GPU count or if additional GPU memory allows, this value can be increased for higher performance. Note that higher batch sizes use more GPU memory.\n",
    "* `config.model.data.train_ds.prompt_template` - If different field names were used during the dataset creation earlier, update them here with the intended field names. This is what NeMo Toolkit will look for in each dataset element."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba5274b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "config.trainer.precision=\"32\"\n",
    "config.trainer.devices=1\n",
    "config.trainer.num_nodes=1\n",
    "config.trainer.max_epochs=3 \n",
    "config.model.restore_from_path=\"/project/models/llama-7b.nemo\"\n",
    "config.model.peft.peft_scheme=\"ptuning\"\n",
    "config.model.data.train_ds.file_names=[\"/project/data/dolly_train.jsonl\"] \n",
    "config.model.data.validation_ds.file_names=[\"/project/data/dolly_val.jsonl\"]\n",
    "config.model.global_batch_size=2 \n",
    "config.model.micro_batch_size=1 \n",
    "config.model.optim.lr=0.0001\n",
    "config.model.data.train_ds.concat_sampling_probabilities=[1.0] \n",
    "config.model.data.train_ds.prompt_template=\"Context: {context}\\n\\nQuestion: {question}\\n\\nAnswer:{answer}\" \n",
    "config.model.peft.p_tuning.virtual_tokens=15 \n",
    "config.model.data.train_ds.label_key=\"answer\"\n",
    "config.model.data.train_ds.truncation_field=\"context\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8edce694",
   "metadata": {},
   "source": [
    "With the config settings updated, save it as a `.yaml` file that can be read by NeMo Toolkit during p-tuning and save it to the p-tuning configuration directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44751ddc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# OmegaConf.save can also accept a `str` or `pathlib.Path` instance:\n",
    "OmegaConf.save(config, \"llama-config.yaml\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6621db05",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mv llama-config.yaml /opt/NeMo/examples/nlp/language_modeling/tuning/conf/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1058f600",
   "metadata": {},
   "source": [
    "## Launching the job\n",
    "With the model downloaded, the dataset prepped, and the config set, it is now time to launch the p-tuning job! The following block launches the job on the specified number of GPUs. Depending on the size of the dataset and the GPU used, this could take anywhere from a few minutes to several hours to finish. As the model is tuned, checkpoints will be saved in the `nemo_experiments` directory inside the container. These checkpoints contain prompt embeddings which are used to send inference requests with the p-tuned weights to deployed models so they respond as expected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc8f8e1d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py \\\n",
    "    --config-name=llama-config.yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65f339ce",
   "metadata": {},
   "source": [
    "### Configuring the evaluation script\n",
    "The configuration file for the evaluation job needs to be updated based on the provided template to reflect the changes for the experiment here. Load the template config and update some of the settings to match the local environment. Note the following settings may differ for custom datasets:\n",
    "* `config.model.data.test_ds.file_names` - List any prediction files that should be used to evaluate the model. In general, it is recommended to have this be different from the training and validation files used during p-tuning. For simplicities sake, we generate a single example here. You may add more if you wish. \n",
    "* `config.model.data.test_ds.names` - Change the name of the dataset used here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d5eb133",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's create our test file. \n",
    "\n",
    "import json\n",
    "test = [{\"question\": \"Is the lawn mower product solar powered?\", \n",
    "         \"context\": \"The Auto Chef Master is a personal kitchen robot that effortlessly turns raw ingredients into gourmet meals with the precision of a Michelin-star chef. The Eco Lawn Mower is a solar powered high-tech lawn mower that provides an eco-friendly and efficient way to maintain your lawn.\", \n",
    "         \"answer\": \"Yes, the Eco Lawn Mower is solar powered.\", \n",
    "         \"taskname\": \"genqa\"}]\n",
    "        \n",
    "with open('/project/data/dolly_test.jsonl', 'w+') as outfile:\n",
    "    for entry in test:\n",
    "        json.dump(entry, outfile)\n",
    "        outfile.write('\\n')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4bae1db4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the template config file\n",
    "config = OmegaConf.load(\"/opt/NeMo/examples/nlp/language_modeling/tuning/conf/megatron_gpt_peft_eval_config.yaml\")\n",
    "\n",
    "# Override required settings\n",
    "config.peft_scheme=\"ptuning\"\n",
    "config.model.restore_from_path=\"/project/models/llama-7b.nemo\"\n",
    "config.model.peft.restore_from_path=\"/project/code/nemo_experiments/megatron_gpt_peft_tuning/checkpoints/megatron_gpt_peft_tuning.nemo\"\n",
    "config.model.data.test_ds.file_names=[\"/project/data/dolly_test.jsonl\"]\n",
    "config.model.data.test_ds.names=\"dolly\"\n",
    "config.model.data.test_ds.global_batch_size=2\n",
    "config.model.data.test_ds.micro_batch_size=1\n",
    "config.model.data.test_ds.write_predictions_to_file=True\n",
    "config.model.data.test_ds.output_file_path_prefix=\"/project/code/predictions\"\n",
    "config.model.data.test_ds.prompt_template=\"Context: {context}\\n\\nQuestion: {question}\\n\\nAnswer: {answer}\"\n",
    "\n",
    "# Save the new config file\n",
    "OmegaConf.save(config, \"llama-eval-config.yaml\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04e4bf88",
   "metadata": {},
   "source": [
    "Once the config is saved, evaluation can be launched below. Depending on the size of the hardware and the number of inference examples, this may take a few minutes to complete. Results will be saved to `code/predictions_test_dolly_inputs_preds_labels.jsonl`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9183388",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mv llama-eval-config.yaml /opt/NeMo/examples/nlp/language_modeling/tuning/conf/\n",
    "!python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_peft_eval.py --config-name=llama-eval-config.yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30760779",
   "metadata": {},
   "source": [
    "Note depending on hardware and the number of examples used, the evaluation script may take a while to run since we are using a training container setting and not currently optimizing for inference. Once we are ready to serve the finetuned model for true deployment, we may then move the model to an optimized inference framework like Triton and/or TensorRT-LLM. \n",
    "\n",
    "After the evaluation script completes, view the results. Keep in mind the results you see may vary in quality for a variety of reasons: \n",
    "* The 7B parameter version of Llama 2 is used in this workflow. 13B and 70B may generate better quality responses. \n",
    "* Hyperparameters can be adjusted in the config files to improve performance. \n",
    "* More or better performing hardware can enable more efficient fine-tuning for more epochs. \n",
    "\n",
    "The point is fine tuning the out-of-the-box model to the general QA task seems to be easy and straightforward with this formula; just imagine what 13B and 70B results will look like!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51bb4c8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -n 1 /project/code/predictions_test_dolly_inputs_preds_labels.jsonl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1155444c",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
